Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package

10/26/2018 ∙ by Ajay Patel, et al. ∙ Plasticity Inc. University of Pennsylvania 0

Vector space embedding models like word2vec, GloVe, fastText, and ELMo are extremely popular representations in natural language processing (NLP) applications. We present Magnitude, a fast, lightweight tool for utilizing and processing embeddings. Magnitude is an open source Python package with a compact vector storage file format that allows for efficient manipulation of huge numbers of embeddings. Magnitude performs common operations up to 60 to 6,000 times faster than Gensim. Magnitude introduces several novel features for improved robustness like out-of-vocabulary lookups.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Magnitude is an open source Python package developed by Ajay Patel and Alexander Sands Patel and Sands (2018). It provides a full set of features and a new vector storage file format that make it possible to use vector embeddings in a fast, efficient, and simple manner. It is intended to be a simpler and faster alternative to current utilities for word vectors like Gensim Řehůřek and Sojka (2010).

Magnitude’s file format (“.magnitude”) is an efficient universal vector embedding format. The Magnitude library implements on-demand lazy loading for faster file loading, caching for better performance of repeated queries, and fast processing of bulk key queries. Table 1 gives speed benchmark comparisons between Magnitude and Gensim for various operations on the Google News pre-trained word2vec model Mikolov et al. (2013). Loading the binary files containing the word vectors takes Gensim 70 seconds, versus 0.72 seconds to load the corresponding Magnitude file, a 97x speed-up. Gensim uses 5GB of RAM versus 18KB for Magnitude.

Magnitude implements functions for looking up vector representations for misspelled or out-of-vocabulary words, quantization of vector models, exact and approximate similarity search, concatenating multiple vector models together, and manipulating models that are larger than a computer’s main memory. Magnitude’s ease of use and simple interface combined with its speed, efficiency, and novel features make it an excellent tool for cases ranging from applications used in production environments to academic research to students in natural language processing courses.

Metric Cold Warm
Initial load time 97x
Single key query 1x 110x
Multiple key query (n=25) 68x 3x
k-NN search query (k=10) 1x 5,935x
Table 1: Speed comparison of Magnitude versus Gensim for common operations. The ‘cold’ column represents the first time the operation is called. The ‘warm’ column indicates a subsequent call with the same keys.

2 Motivation

Magnitude offers solutions to a number of problems with current utilities.

Speed:

Existing utilities are prohibitively slow for iterative development. Many projects use Gensim to load the Google News word2vec model directly from a “.bin” or “.txt” file multiple times. It can take between a minute to a minute and a half to load the file.

Memory:

A production web server will run multiple processes for serving requests. Running Gensim, in the same configuration, will consume 4GB of RAM usage per process.

Code duplication:

Many developers duplicate effort by writing commonly used routines that are not provided in current utilities. Namely, routines for concatenating embeddings, bulk key lookup, out-of-vocabulary search, and building indexes for approximate k-nearest neighbors.

The Magnitude library uses several well-engineered libraries to achieve its performance improvements. It uses SQLite111https://www.sqlite.org/ as its underlying data store, and takes advantage of database indexes for fast key lookups and memory mapping. It uses NumPy222http://www.numpy.org/ to achieve significant performance speedups over native Python code using computations that follow the Single Instruction, Multiple Data (SIMD) paradigm. It uses spatial indexes to perform fast exact similarity search and Annoy333https://github.com/spotify/annoy to perform approximate k-nearest neighbors in the vector space. To perform feature hashing, it uses xxHash444https://xxhash.org/, an extremely fast non-cryptographic hash algorithm, working at speeds close to RAM limits. Magnitude’s file format uses LZ4 compression555http://www.lz4.org/ for compact storage.

3 Design Principles

Several design principles guided the development of the Magnitude library:

  • The API should be intuitive and beginner friendly. It should have sensible defaults instead of requiring configuration choices by the user. The option to configure every setting should still be provided to power users.

  • The out of the box configuration should be fast and memory efficient for iterative development. It should be suitable for deployment in a production environment. Using the same configuration in development and production reduces bugs and makes deployment easier.

  • The library should use lazy loading whenever possible to remain fast, responsive, and memory efficient during development.

  • The library should aggressively index, cache, and use memory maps to be fast, responsive, and memory efficient for production.

  • The library should be able to process data that is too large to fit into a computer’s main memory.

  • The library should be thread-safe and employ memory mapping to reduce duplicated memory resources when multiprocessing.

  • The interface should act as a generic key-vector store and remain agnostic to underlying models (like word2vec, GloVe, fastText, and ELMo) and remain useable for other domains that use vector embeddings like computer vision 

    Babenko and Lempitsky (2016).

Gensim offers several speed ups of its operations, but these are largely only accessible through advanced configuration. For example, by re-exporting a “.bin”, “.txt”, or “.vec” file into its own native format that can be memory-mapped. Magnitude makes this easier by providing a default configuration and file format that requires no extra configuration to make development and production workloads run efficiently out of the box.

4 Getting Started with Magnitude

The system consists of a Python 2.7 and Python 3.x compatible package (accessible through the PyPI index666https://pypi.org/project/pymagnitude/ or GitHub777https://github.com/plasticityai/magnitude) with utilities for using the “.magnitude” format and converting to it from other popular embedding formats.

4.1 Installation

Installation for Python 2.7 can be performed using the pip command:

Installation for Python 3.x can be performed using the pip3 command:

4.2 Basic Usage

Here is how to construct the Magnitude object, query for vectors, and compare them:

vectors = Magnitude(”w2v.magnitude”)
k = vectors.query(”king”)
q = vectors.query(”queen”)
vectors.similarity(k,q)  # 0.6510958

Magnitude queries return almost instantly and are memory efficient. It uses lazy loading directly from disk, instead of having to load the entire model into memory. Additionally, Magnitude supports nearest neighbors operations, finding all words that are closer to a key than another key, and analogy solving (optionally with levy2014linguistic’s 3CosMul function):

#[(‘king’, 1.0), (‘kings’, 0.71),
# (‘queen’, 0.65), (‘monarch’, 0.64),
# (‘crown_prince’, 0.62)]
vectors.most_similar(q, topn=5)
#[(‘queen’, 1.0), (‘queens’, 0.74),
#(‘princess’, 0.71), (‘king’, 0.65),
# (’monarch’, 0.64)]
vectors.closer_than(”queen”, ”king”)
#[‘queens’, ‘princess’]
vectors.most_similar(
        positive = [”woman”, ”king”],
        negative = [”man”]
)  # queen
vectors.most_similar_cosmul(
        positive = [”woman”, ”king”],
        negative = [”man”]
)  # queen

In addition to querying single words, Magnitude also makes it easy to query for multiple words in a single sentence and multiple sentences:

# Returns: a vector for the word
vectors.query([”play”, ”music”])
# Returns: an array with two vectors
vectors.query([
 [”play”, ”music”],
 [”turn”, ”on”, ”the”, ”lights”],
])  # Returns: 2D array with vectors

4.3 Advanced Features

OOVs:

Magnitude implements a novel method for handling out-of-vocabulary (OOV) words. OOVs frequently occur in real world data since pre-trained models are often missing slang, colloquialisms, new product names, or misspellings. For example, while uber exists in Google News word2vec, uberx and uberxl do not. These products were not available when Google News corpus was built. Strategies for representing these words include generating random unit-length vectors for each unknown word or mapping all unknown words to a token like “UNK” and representing them with the same vector. These solutions are not ideal as the embeddings will not capture semantic information about the actual word. Using Magnitude, these OOV words can be simply queried and will be positioned in the vector space close to other OOV words based on their string similarity:

”uberx” in vectors  # False
”uberxl” in vectors  # False
vectors.query(”uberx”)
   # Returns: [ 0.0507, -0.0708, …]
vectors.query(”uberxl”)
   # Returns: [ 0.0473, -0.08237, …]
vectors.similarity(”uberx”, ”uberxl”)
   # Returns: 0.955

A consequence of generating OOV vectors is that misspellings and typos are also sensibly handled:

”discrimnatory” in vectors  # False
”hiiiiiiiiii” in vectors  # False
vectors.similarity(
  ”missispi”,
  ”mississippi”
)  # Returns: 0.359
vectors.similarity(
  ”discrimnatory”,
  ”discriminatory”
)  # Returns: 0.830
vectors.similarity(
  ”hiiiiiiiiii”,
  ”hi”
)  # Returns: 0.706

The OOV handling is detailed in Section 5.

Concatenation of Multiple Models:

Magnitude makes it easy to concatenate multiple types of vector embeddings to create combined models.

gv = Magnitude(”glove.50d.magnitude”)
vectors = Magnitude(w2v, gv)  # concat
vectors.query(”cat”)
# Returns: 350d NumPy array
# ’cat’ from w2v and ’cat’ from gv
vectors.query((”cat”, ”cats”))
# Returns: 350d NumPy array
# ’cat’ from w2v and ’cats’ from gv

Adding Features for Part-of-Speech Tags and Syntax Dependencies to Vectors:

Magnitude can directly turn a set of keys (like a POS tag set) into vectors. Given an approximate upper bound on the number of keys and a namespace, it uses the hashing trick Weinberger et al. (2009) to create an appropriate length dimension for the keys.

  100, namespace = ”POS”)
pos_vecs.dim  # 4
# number of dims automatically
# determined by Magnitude from 100
pos_vecs.query(”NN”)
dep_vecs = FeaturizerMagnitude(
  100, namespace = ”Dep”)
dep_vecs.dim  # 4
dep_vecs.query(”nsubj”)

This can be used with Magnitude’s concatenation feature to combine the vectors for words with the vectors for POS tags or dependency tags. Homonyms show why this may be useful:

                    dep_vecs)
vectors.query([
    (”Buffalo”, ”JJ”, ”amod”),
    (”buffalo”, ”NNS”, ”nsubj”),
    (”Buffalo”, ”JJ”, ”amod”),
    (”buffalo”, ”NNS”, ”nsubj”),
    (”buffalo”,  ”VBP”, ”rcmod”),
    (”buffalo”,  ”VB”, ”ROOT”),
    (”Buffalo”,  ”JJ”, ”amod”),
    (”buffalo”,  ”NNS”, ”dobj”)
  ])  # array of 8 x (300 + 4 + 4)

Approximate k-NN

We support approximate similarity search with the most_similar_approx function. This finds the approximate nearest neighbors more quickly than the exact nearest neighbors search performed by the most_similar function. The method accepts an effort argument which accepts the range . A lower effort will reduce accuracy, but increase speed. A higher effort does the reverse. This trade-off works by searching more- or less-indexed trees. Our approximate k-NN is powered by Annoy, an open source library released by Spotify. Table 2 compares the speed of various configurations for similarity search.

Metric Speed
Exact k-NN 0.9155s
Approx. k-NN (k=10, effort = 1.0) 0.1873s
Approx. k-NN (k=10, effort = 0.1) 0.0199s
Table 2: Approximate nearest neighbors significantly speeds up similarity searches compared to exact search. Reducing the amount of allowed effort further speeds the approximate k-NN search.

5 Details of OOV Handling

Facebook’s fastText Bojanowski et al. (2016) provides similar OOV functionality to Magnitude’s. Magnitude allows for OOV lookups for any embedding model, including older models like word2vec and GloVe Mikolov et al. (2013); Pennington et al. (2014), which did not provide OOV support. Magnitude’s OOV method can be used with existing embeddings because it does not require any changes to be made at training time like fastText’s method does. For ELMo vectors, Magnitude will use ELMo’s OOV method.

Constructing vectors from character n-grams:

We generate a vector for an OOV word

based on the character n-gram sequences in the word. First, we pad the word with a character at the beginning of the word and at the end of the word. Next, we generate the set of all character-ngrams in

(denoted with the fuction ) between length 3 and 6, following DBLP:journals/corr/BojanowskiGJM16, although these parameters are tunable arguments in the Magnitude converter. We use the set of character n-grams to construct a vector with dimensions to represent the word . Each unique character n-gram from the word contributes to the vector through a pseudorandom vector generator function PRVG. Finally, the vector is normalized.

PRVG’s random number generator is seeded by the value “seed”, which generates uniformly random vectors of dimension size , with values in the range of -1 to 1. The hashing function H produces a 32 bit hash of its input using xxHash. . Since the PRVG’s seed is only conditioned upon the word , the output is deterministic across different machines.

This character n-gram-based method will generate highly similar vectors for a pair of OOVs with similar spellings, like uberx and uberxl. However, they will not be embedded close to similar in-vocabulary words like uber.

Interpolation with in-vocabulary words

To handle matching OOVs to in-vocabulary words, we first define a function . returns the normalized mean of the vectors of the top

most string-similar in-vocabulary words using the full-text SQLite index. In practice, we use the top 3 most string-similar words. These are then used to interpolate the values for the vector representing the OOV word. 30% of the weight for each value comes from the pseudorandom vector generator based on the OOV’s n-grams, and the remaining 70% comes from the values of the 3 most string similar in-vocabulary words:

Morphology-aware matching

For English, we have implemented a nuanced string similarity metric that is prefix- and suffix-aware. While uberification has a high string similarity to verification and has a lower string similarity to uber, good OOV vectors should weight stems more heavily than suffixes. Details of our morphology-aware matching are omitted for space.

Other matching nuances

We employ other techniques when computing the string similarity metric, such as shrinking repeated character sequences of three or more to two (hiiiiiiii hii), ranking strings of a similar length higher, and ranking strings that share the same first or last character higher for shorter words.

6 File Format

To provide efficiency at runtime, Magnitude uses a custom “.magnitude” file format instead of “.bin”, “.txt”, “.vec”, or “.hdf5” that word2vec, GloVe, fastText, and ELMo use Mikolov et al. (2013); Pennington et al. (2014); Joulin et al. (2016); Peters et al. (2018). The “.magnitude” file is a SQLite database file. There are 3 variants of the file format: Light, Medium, Heavy. Heavy models have the largest file size but support all of the Magnitude library’s features. Medium models support all features except approximate similarity search. Light models do not support approximate similarity searches or interpolated OOV lookups, but they still support basic OOV lookups. See Figure 1 for more information about the structure and layout of the “.magnitude” format.

width=

Keys and Unit-Length Normalized Vectors

SQLite Index over Keys

Character N-Grams Enumerated for all Keys

SQLite Full-Text Search Index over all N-Grams

LZ4 Compressed Annoy mmap Index for all Vectors

SQLite Database

Light

Medium

Heavy

Format Settings and Metadata

Figure 1: Structure of the “.magnitude” file format and its Light, Medium, and Heavy variants.

Converter

The software includes a command-line converter utility for converting word2vec (“.bin”, “.txt”), GloVe (“.txt”), fastText (“.vec”), or ELMo (“.hdf5”) files to Magnitude files. They can be converted with the command:

  -i ”./vecs.(bin|txt|vec|hdf5)”
  -o ”./vecs.magnitude”

The input format will automatically be determined by the extension and the contents of the input file. When the vectors are converted, they will also be unit-length normalized. This conversion process only needs to be completed once per model. After converting, the Magnitude file format is static and it will not be modified or written to in order to make concurrent read access safe.

By default, the converter builds a Medium “.magnitude” file. Passing the -s flag will turn off encoding of subword information, and result in a Light flavored file. Passing the -a flag will turn on building the Annoy approximate similarity index, and result in a Heavy flavored file. Refer to the documentation888https://github.com/plasticityai/magnitude#file-format-and-converter for more information about conversion configuration options.

Quantization

The converter utility accepts a -p <PRECISION> flag to specify the decimal precision to retain. Since underlying values are stored as integers instead of floats, this is essentially quantization999https://www.tensorflow.org/performance/quantization for smaller model footprints. Lower decimal precision will create smaller files, because SQLite can store integers with either 1, 2, 3, 4, 6, or 8 bytes.101010https://www.sqlite.org/datatype3.html Regardless of the precision selected, the library will create numpy.float32 vectors. The datatype can be changed by passing dtype=numpy.float16 to the Magnitude constructor.

7 Conclusion

Magnitude is a new open source Python library and file format for vector embeddings. It makes it easy to integrate embeddings into applications and provides a single interface and configuration that is suitable for both development and production workloads. The library and file format also enable novel features like OOV handling that allow models to be more robust to noisy data. The simple interface, ease of use, and speed of the library, compared to other utilities like Gensim, will enable use by beginners to NLP and individuals in educational environments, such as university NLP and AI courses.

Pre-trained word embeddings have been widely adopted in NLP. Researchers in computer vision have started using pre-trained vector embedding models like Deep1B Babenko and Lempitsky (2016) for images. The Magnitude library intends to stay agnostic to various domains, instead providing a generic key-vector store and interface that is useful for all domains and for research that crosses the boundaries between NLP and vision Hewitt et al. (2018).

8 Software and Data

We release the Magnitude package under the permissive MIT open source license. The full source code and pre-converted “.magnitude” models are on GitHub. The full documentation for all classes, methods, and configurations of the library can be found at https://github.com/plasticityai/magnitude, along with example usage and tutorials.

We have pre-converted several popular embedding models (Google News word2vec, Stanford GloVe, Facebook fastText, AI2 ELMo) to “.magnitude” in all its variants (Light, Medium, and Heavy). You can download them from https://github.com/plasticityai/magnitude#pre-converted-magnitude-formats-of-popular-embeddings-models.

Acknowledgments

We would like to thank Erik Bernhardsson for the useful feedback on integrating Annoy indexing into Magnitude and thank the numerous contributors who have opened issues, reported bugs, or suggested technical enhancements for Magnitude on GitHub.

This material is funded in part by DARPA under grant number HR0011-15-C-0115 (the LORELEI program) and by NSF SBIR Award #IIP-1820240. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA, the NSF, and the U.S. Government. This work has also been supported by the French National Research Agency under project ANR-16-CE33-0013.

References

Appendix A Benchmark Comparisons

All benchmarks111111https://github.com/plasticityai/magnitude/blob/master/tests/benchmark.py were performed on the Google News pre-trained word vectors, “GoogleNews-vectors-negative300.bin” Mikolov et al. (2013) for Gensim and on the “GoogleNews-vectors-negative300.magnitude”121212http://magnitude.plasticity.ai/word2vec+approx/GoogleNews-vectors-negative300.magnitude for Magnitude, with a MacBook Pro (Retina, 15-inch, Mid 2014) 2.2GHz quad-core Intel Core i7 @ 16GB RAM on a SSD over an average of trials where feasible. We are explicitly not using Gensim’s memory-mapped native format as it requires extra configuration from the developer and is not provided out of the box from Gensim’s data repository 131313https://github.com/RaRe-Technologies/gensim-data.

Metric     Gensim
Řehůřek and Sojka (2010)
   Magnitude
Light
   Magnitude
Medium
    Magnitude
Heavy
Initial load time 70.26s 0.7210s — 141414 Denotes the same value as the previous column. — 14 Cold single key query 0.0001s 0.0001s — 14 — 14 Warm single key query (same key as cold query) 0.0044s 0.00004s — 14 — 14 Cold multiple key query (n=25) 3.0050s 0.0442s — 14 — 14 Warm multiple key query (n=25) (same keys as cold query) 0.0001s 0.00004s — 14 — 14 First most_similar search query (n=10) (worst case) 18.493s 247.05s — 14 — 14 First most_similar search query (n=10) (average case) (w/ disk persistent cache) 18.917s 1.8217s — 14 — 14 Subsequent most_similar search (n=10) (different key than first query) 0.2546s 0.2434s — 14 — 14 Warm subsequent most_similar search (n=10) (same key as first query) 0.2374s 0.00004s 0.00004s 0.00004s First most_similar_approx search query (n=10, effort=1.0) (worst case) N/A 151515 Gensim does support approximate similarity search, but not out of the box as the index must be built manually with gensim.similarities.index first which is a slow operation. N/A N/A 29.610s First most_similar_approx search query (n=10, effort=1.0) (average case) (w/ disk persistent cache) N/A N/A N/A 0.9155s Subsequent most_similar_approx search (n=10, effort=1.0) (different key than first query) N/A N/A N/A 0.1873s Subsequent most_similar_approx search (n=10, effort=0.1) (different key than first query) N/A N/A N/A 0.0199s Warm subsequent most_similar_approx search (n=10, effort=1.0) (same key as first query) N/A N/A N/A 0.00004s File size 3.64GB 4.21GB 5.29GB 10.74GB Process memory (RAM) utilization 4.875GB 18KB — 14 — 14 Process memory (RAM) utilization after 100 key queries 4.875GB 168KB — 14 — 14 Process memory (RAM) utilization after 100 key queries + similarity search 8.228GB 161616 Gensim has an option to not duplicate unit-normalized vectors in memory, but still requires up to 8GB of memory allocation while processing, before dropping down to half the memory. Moreover, this option is not on by default. 342KB 171717 Magnitude uses mmap to read from the disk, so the OS will still allocate pages of memory, when memory is available, in its file cache, but it can be shared between processes and is not managed within each process for extremely large files which is a performance win. — 14 — 14
Table 3: Benchmark comparisons between Gensim, Magnitude Light, Magnitude Medium, and Magnitude Heavy.