Magnitude is an open source Python package developed by Ajay Patel and Alexander Sands Patel and Sands (2018). It provides a full set of features and a new vector storage file format that make it possible to use vector embeddings in a fast, efficient, and simple manner. It is intended to be a simpler and faster alternative to current utilities for word vectors like Gensim Řehůřek and Sojka (2010).
Magnitude’s file format (“.magnitude”) is an efficient universal vector embedding format. The Magnitude library implements on-demand lazy loading for faster file loading, caching for better performance of repeated queries, and fast processing of bulk key queries. Table 1 gives speed benchmark comparisons between Magnitude and Gensim for various operations on the Google News pre-trained word2vec model Mikolov et al. (2013). Loading the binary files containing the word vectors takes Gensim 70 seconds, versus 0.72 seconds to load the corresponding Magnitude file, a 97x speed-up. Gensim uses 5GB of RAM versus 18KB for Magnitude.
Magnitude implements functions for looking up vector representations for misspelled or out-of-vocabulary words, quantization of vector models, exact and approximate similarity search, concatenating multiple vector models together, and manipulating models that are larger than a computer’s main memory. Magnitude’s ease of use and simple interface combined with its speed, efficiency, and novel features make it an excellent tool for cases ranging from applications used in production environments to academic research to students in natural language processing courses.
|Initial load time||97x||–|
|Single key query||1x||110x|
|Multiple key query (n=25)||68x||3x|
|k-NN search query (k=10)||1x||5,935x|
Magnitude offers solutions to a number of problems with current utilities.
Existing utilities are prohibitively slow for iterative development. Many projects use Gensim to load the Google News word2vec model directly from a “.bin” or “.txt” file multiple times. It can take between a minute to a minute and a half to load the file.
A production web server will run multiple processes for serving requests. Running Gensim, in the same configuration, will consume 4GB of RAM usage per process.
Many developers duplicate effort by writing commonly used routines that are not provided in current utilities. Namely, routines for concatenating embeddings, bulk key lookup, out-of-vocabulary search, and building indexes for approximate k-nearest neighbors.
The Magnitude library uses several well-engineered libraries to achieve its performance improvements. It uses SQLite111https://www.sqlite.org/ as its underlying data store, and takes advantage of database indexes for fast key lookups and memory mapping. It uses NumPy222http://www.numpy.org/ to achieve significant performance speedups over native Python code using computations that follow the Single Instruction, Multiple Data (SIMD) paradigm. It uses spatial indexes to perform fast exact similarity search and Annoy333https://github.com/spotify/annoy to perform approximate k-nearest neighbors in the vector space. To perform feature hashing, it uses xxHash444https://xxhash.org/, an extremely fast non-cryptographic hash algorithm, working at speeds close to RAM limits. Magnitude’s file format uses LZ4 compression555http://www.lz4.org/ for compact storage.
3 Design Principles
Several design principles guided the development of the Magnitude library:
The API should be intuitive and beginner friendly. It should have sensible defaults instead of requiring configuration choices by the user. The option to configure every setting should still be provided to power users.
The out of the box configuration should be fast and memory efficient for iterative development. It should be suitable for deployment in a production environment. Using the same configuration in development and production reduces bugs and makes deployment easier.
The library should use lazy loading whenever possible to remain fast, responsive, and memory efficient during development.
The library should aggressively index, cache, and use memory maps to be fast, responsive, and memory efficient for production.
The library should be able to process data that is too large to fit into a computer’s main memory.
The library should be thread-safe and employ memory mapping to reduce duplicated memory resources when multiprocessing.
Gensim offers several speed ups of its operations, but these are largely only accessible through advanced configuration. For example, by re-exporting a “.bin”, “.txt”, or “.vec” file into its own native format that can be memory-mapped. Magnitude makes this easier by providing a default configuration and file format that requires no extra configuration to make development and production workloads run efficiently out of the box.
4 Getting Started with Magnitude
The system consists of a Python 2.7 and Python 3.x compatible package (accessible through the PyPI index666https://pypi.org/project/pymagnitude/ or GitHub777https://github.com/plasticityai/magnitude) with utilities for using the “.magnitude” format and converting to it from other popular embedding formats.
Installation for Python 2.7 can be performed using the pip command:
Installation for Python 3.x can be performed using the pip3 command:
4.2 Basic Usage
Here is how to construct the Magnitude object, query for vectors, and compare them:
Magnitude queries return almost instantly and are memory efficient. It uses lazy loading directly from disk, instead of having to load the entire model into memory. Additionally, Magnitude supports nearest neighbors operations, finding all words that are closer to a key than another key, and analogy solving (optionally with levy2014linguistic’s 3CosMul function):
In addition to querying single words, Magnitude also makes it easy to query for multiple words in a single sentence and multiple sentences:
4.3 Advanced Features
Magnitude implements a novel method for handling out-of-vocabulary (OOV) words. OOVs frequently occur in real world data since pre-trained models are often missing slang, colloquialisms, new product names, or misspellings. For example, while uber exists in Google News word2vec, uberx and uberxl do not. These products were not available when Google News corpus was built. Strategies for representing these words include generating random unit-length vectors for each unknown word or mapping all unknown words to a token like “UNK” and representing them with the same vector. These solutions are not ideal as the embeddings will not capture semantic information about the actual word. Using Magnitude, these OOV words can be simply queried and will be positioned in the vector space close to other OOV words based on their string similarity:
A consequence of generating OOV vectors is that misspellings and typos are also sensibly handled:
The OOV handling is detailed in Section 5.
Concatenation of Multiple Models:
Magnitude makes it easy to concatenate multiple types of vector embeddings to create combined models.
Adding Features for Part-of-Speech Tags and Syntax Dependencies to Vectors:
Magnitude can directly turn a set of keys (like a POS tag set) into vectors. Given an approximate upper bound on the number of keys and a namespace, it uses the hashing trick Weinberger et al. (2009) to create an appropriate length dimension for the keys.
This can be used with Magnitude’s concatenation feature to combine the vectors for words with the vectors for POS tags or dependency tags. Homonyms show why this may be useful:
We support approximate similarity search with the most_similar_approx function. This finds the approximate nearest neighbors more quickly than the exact nearest neighbors search performed by the most_similar function. The method accepts an effort argument which accepts the range . A lower effort will reduce accuracy, but increase speed. A higher effort does the reverse. This trade-off works by searching more- or less-indexed trees. Our approximate k-NN is powered by Annoy, an open source library released by Spotify. Table 2 compares the speed of various configurations for similarity search.
|Approx. k-NN (k=10, effort = 1.0)||0.1873s|
|Approx. k-NN (k=10, effort = 0.1)||0.0199s|
5 Details of OOV Handling
Facebook’s fastText Bojanowski et al. (2016) provides similar OOV functionality to Magnitude’s. Magnitude allows for OOV lookups for any embedding model, including older models like word2vec and GloVe Mikolov et al. (2013); Pennington et al. (2014), which did not provide OOV support. Magnitude’s OOV method can be used with existing embeddings because it does not require any changes to be made at training time like fastText’s method does. For ELMo vectors, Magnitude will use ELMo’s OOV method.
Constructing vectors from character n-grams:
We generate a vector for an OOV word
based on the character n-gram sequences in the word. First, we pad the word with a character at the beginning of the word and at the end of the word. Next, we generate the set of all character-ngrams in(denoted with the fuction ) between length 3 and 6, following DBLP:journals/corr/BojanowskiGJM16, although these parameters are tunable arguments in the Magnitude converter. We use the set of character n-grams to construct a vector with dimensions to represent the word . Each unique character n-gram from the word contributes to the vector through a pseudorandom vector generator function PRVG. Finally, the vector is normalized.
PRVG’s random number generator is seeded by the value “seed”, which generates uniformly random vectors of dimension size , with values in the range of -1 to 1. The hashing function H produces a 32 bit hash of its input using xxHash. . Since the PRVG’s seed is only conditioned upon the word , the output is deterministic across different machines.
This character n-gram-based method will generate highly similar vectors for a pair of OOVs with similar spellings, like uberx and uberxl. However, they will not be embedded close to similar in-vocabulary words like uber.
Interpolation with in-vocabulary words
To handle matching OOVs to in-vocabulary words, we first define a function . returns the normalized mean of the vectors of the top
most string-similar in-vocabulary words using the full-text SQLite index. In practice, we use the top 3 most string-similar words. These are then used to interpolate the values for the vector representing the OOV word. 30% of the weight for each value comes from the pseudorandom vector generator based on the OOV’s n-grams, and the remaining 70% comes from the values of the 3 most string similar in-vocabulary words:
For English, we have implemented a nuanced string similarity metric that is prefix- and suffix-aware. While uberification has a high string similarity to verification and has a lower string similarity to uber, good OOV vectors should weight stems more heavily than suffixes. Details of our morphology-aware matching are omitted for space.
Other matching nuances
We employ other techniques when computing the string similarity metric, such as shrinking repeated character sequences of three or more to two (hiiiiiiii hii), ranking strings of a similar length higher, and ranking strings that share the same first or last character higher for shorter words.
6 File Format
To provide efficiency at runtime, Magnitude uses a custom “.magnitude” file format instead of “.bin”, “.txt”, “.vec”, or “.hdf5” that word2vec, GloVe, fastText, and ELMo use Mikolov et al. (2013); Pennington et al. (2014); Joulin et al. (2016); Peters et al. (2018). The “.magnitude” file is a SQLite database file. There are 3 variants of the file format: Light, Medium, Heavy. Heavy models have the largest file size but support all of the Magnitude library’s features. Medium models support all features except approximate similarity search. Light models do not support approximate similarity searches or interpolated OOV lookups, but they still support basic OOV lookups. See Figure 1 for more information about the structure and layout of the “.magnitude” format.
The software includes a command-line converter utility for converting word2vec (“.bin”, “.txt”), GloVe (“.txt”), fastText (“.vec”), or ELMo (“.hdf5”) files to Magnitude files. They can be converted with the command:
The input format will automatically be determined by the extension and the contents of the input file. When the vectors are converted, they will also be unit-length normalized. This conversion process only needs to be completed once per model. After converting, the Magnitude file format is static and it will not be modified or written to in order to make concurrent read access safe.
By default, the converter builds a Medium “.magnitude” file. Passing the -s flag will turn off encoding of subword information, and result in a Light flavored file. Passing the -a flag will turn on building the Annoy approximate similarity index, and result in a Heavy flavored file. Refer to the documentation888https://github.com/plasticityai/magnitude#file-format-and-converter for more information about conversion configuration options.
The converter utility accepts a -p <PRECISION> flag to specify the decimal precision to retain. Since underlying values are stored as integers instead of floats, this is essentially quantization999https://www.tensorflow.org/performance/quantization for smaller model footprints. Lower decimal precision will create smaller files, because SQLite can store integers with either 1, 2, 3, 4, 6, or 8 bytes.101010https://www.sqlite.org/datatype3.html Regardless of the precision selected, the library will create numpy.float32 vectors. The datatype can be changed by passing dtype=numpy.float16 to the Magnitude constructor.
Magnitude is a new open source Python library and file format for vector embeddings. It makes it easy to integrate embeddings into applications and provides a single interface and configuration that is suitable for both development and production workloads. The library and file format also enable novel features like OOV handling that allow models to be more robust to noisy data. The simple interface, ease of use, and speed of the library, compared to other utilities like Gensim, will enable use by beginners to NLP and individuals in educational environments, such as university NLP and AI courses.
Pre-trained word embeddings have been widely adopted in NLP. Researchers in computer vision have started using pre-trained vector embedding models like Deep1B Babenko and Lempitsky (2016) for images. The Magnitude library intends to stay agnostic to various domains, instead providing a generic key-vector store and interface that is useful for all domains and for research that crosses the boundaries between NLP and vision Hewitt et al. (2018).
8 Software and Data
We release the Magnitude package under the permissive MIT open source license. The full source code and pre-converted “.magnitude” models are on GitHub. The full documentation for all classes, methods, and configurations of the library can be found at https://github.com/plasticityai/magnitude, along with example usage and tutorials.
We have pre-converted several popular embedding models (Google News word2vec, Stanford GloVe, Facebook fastText, AI2 ELMo) to “.magnitude” in all its variants (Light, Medium, and Heavy). You can download them from https://github.com/plasticityai/magnitude#pre-converted-magnitude-formats-of-popular-embeddings-models.
We would like to thank Erik Bernhardsson for the useful feedback on integrating Annoy indexing into Magnitude and thank the numerous contributors who have opened issues, reported bugs, or suggested technical enhancements for Magnitude on GitHub.
This material is funded in part by DARPA under grant number HR0011-15-C-0115 (the LORELEI program) and by NSF SBIR Award #IIP-1820240. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA, the NSF, and the U.S. Government. This work has also been supported by the French National Research Agency under project ANR-16-CE33-0013.
Babenko and Lempitsky (2016)
Artem Babenko and Victor Lempitsky. 2016.
Efficient Indexing of Billion-Scale Datasets of Deep Descriptors.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2055–2063, Las Vegas, NV.
- Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. CoRR, abs/1607.04606.
- Hewitt et al. (2018) John Hewitt, Daphne Ippolito, Brendan Callahan, Reno Kriz, Derry Tanti Wijaya, and Chris Callison-Burch. 2018. Learning Translations via Images with a Massively Multilingual Image Dataset. In Proceedings of ACL, pages 2566–2576, Melbourne, Australia.
- Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. CoRR, abs/1607.01759.
- Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Linguistic regularities in sparse and explicit word representations. In Proceedings of CoNLL, pages 171–180, Ann Arbor, Michigan.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781.
- Patel and Sands (2018) Ajay Patel and Alex Sands. 2018. plasticityai/magnitude: Release 0.1.22. https://doi.org/10.5281/zenodo.1255637.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP, pages 1532–1543, Doha, Qatar.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. CoRR, abs/1802.05365.
- Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta.
- Weinberger et al. (2009) Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature Hashing for Large Scale Multitask Learning. In Proceedings of ICML, pages 1113–1120, New York, NY.
Appendix A Benchmark Comparisons
All benchmarks111111https://github.com/plasticityai/magnitude/blob/master/tests/benchmark.py were performed on the Google News pre-trained word vectors, “GoogleNews-vectors-negative300.bin” Mikolov et al. (2013) for Gensim and on the “GoogleNews-vectors-negative300.magnitude”121212http://magnitude.plasticity.ai/word2vec+approx/GoogleNews-vectors-negative300.magnitude for Magnitude, with a MacBook Pro (Retina, 15-inch, Mid 2014) 2.2GHz quad-core Intel Core i7 @ 16GB RAM on a SSD over an average of trials where feasible. We are explicitly not using Gensim’s memory-mapped native format as it requires extra configuration from the developer and is not provided out of the box from Gensim’s data repository 131313https://github.com/RaRe-Technologies/gensim-data.